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[57] ABSTRACT 

A search query is applied to documents in a document 
repository wherein the documents . are .organized into a 
hi erarchy. A search engine searches the hierarchy to return 
documents which match .a_querv term _eithe r directly or 
indirectly. A specific embodiment of the search engine 
organizes the query term into individual subterms and 
matches the subterms against documents, returning only 
those documents which indirectly match the entire search 
query term and directly match at least one of the query 
subterms. 
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INFORMATION RETRIEVAL FROM 
HIERARCHICAL COMPOUND DOCUMENTS 

COPYRIGHT NOTICE 

A portion of the disclosure of this patent document 
contains material which is subject to copyright protection. 
The copyright owner has no objection to the xerographic 
reproduction by anyone of the patent document or the patent 
disclosure in exactly the form it appears in the Patent and 
Trademark Office patent file or records, but otherwise 
reserves all copyright rights whatsoever. 

BACKGROUND OF THE INVENTION 

The present invention relates to the field of electronic 
document storage and management. More specifically, one 
embodiment of the invention provides for a system of 
storing compound documents and searching the stored com- 
pound documents. 

Information has recently undergone a transition from a 
scarce commodity to an overabundant commodity. With a 
scarce commodity, efforts are centered on acquiring the 
commodity, whereas with an overabundant commodity, 
efforts are centered on filtering the commodity to make it 
more valuable. The prime example of this phenomenon is 
the explosion of information resulting from the growth of 
the global internetwork of networks known as the "Internet." 
Networks and computers connected to the Internet pass data 
using the TCP/IP (Transport Control Protocol/Internet 
Protocol) for reliably passing data packets from a source 
node to a destination node. A variety of higher level proto- 
cols are used on top of TCP/IP to transport objects of digital 
data, the particular protocol depending on the nature of the 
objects. For example, e-mail is transported using the Simple 
Mail Transport Protocol (SMTP) and the Post Office Proto- 
col 3 (POP3), while files are transported using the File 
Transfer Protocol (FTP). Hypertext documents and their 
associated effects are transported using the Hypertext Trans- 
port Protocol (HTTP). 

When many hypertext documents are linked to other 
hypertext documents, they collectively form a "web" 
structure, which led to the name "World Wide Web" (often 
shortened to "WWW" or "the Web") for the collection of 
hypertext documents that can be transported using HTTP. Of 
course, hyperlinks are not required in a document for it to be 
transported using HTTP. In fact, any object can be trans- 
ported using HTTP, so long as it conforms to the require- 
ments of HTTP. 

In a typical use of HTTP, a browser sends a uniform 
resource locator (URL) to a Web server and the Web server 
returns a Hypertext Markup Language (HTML) document 
for the browser to display. The browser is one example of an 
HTTP client and is so named because it displays the returned 
hypertext document and allows the user an opportunity to 
select and display other hypertext documents referenced in 
the returned document. The Web server is an Internet node 
which returns hypertext documents requested by HTTP 
clients. 

Some Web servers, in addition to serving static 
documents, can return dynamic documents. A static docu- 
ment is a document which exists on a Web server before a 
request for the document is made and for which the Web 
server merely sends out the static document upon request. A 
static page URL is typically in the form of 
"host^Momam.domain.TU3/rwuVfile" or the like. That 
static page URL refers to a document named "file" which is 
found on the path "/path/* on the machine which has the 



►1,756 

2 

domain name "hostsubdomain.domain.TLD , \ An actual 
domain "www.yahoo.com", refers to the machine (or 
machines) designated "www" at the domain "yahoo" in the 
".com" top-level domain (TLD). By contrast, a dynamic 

5 document is a document which is generated by the Web 
server when it receives a particular URL which the server 
identifies as a request for a dynamic document. 

Many Web servers operate "Web sites" which offer a 
collection of linked hypertext documents controlled by a 

10 single person or entity. Since the Web site is controlled by 
a single person or entity, the hypertext documents, often 
called "Web pages" in this context, have a consistent look 
and subject matter. Especially in the case of Web sites put up 
by commercial interests selling goods and services, the 
hyperlinked documents which form a Web site will have 

35 few, if any, links to pages not controlled by the interest. The 
terms "Web site" and "Web page" are often used 
interchangeably, but herein a "Web page" refers to a single 
hypertext document which forms part of a Web site and 
"Web site" refers to a collection of one or more Web pages 

20 which are controlled (i.e., modifiable) by a single entity or 
group of entities working in concert to present a site on a 
particular topic. 

With all the many sites and pages that the many millions 
of Internet users might make available through their Web 

25 servers, it is often difficult to find a particular page or 
determine where to find information on a particular topic. 
There is no "official" listing of what is available, because 
anyone can place anything on their Web server and need not 
report it to an official agency and the Web changes so 

30 quickly. In the absence of an official "table of contents", 
several approaches to indexing the Web have been proposed. 

One approach is to index all of the Web documents found 
everywhere. While this approach is useful to find a docu- 
ment on a rarely discussed topic or a reference to a person 

35 with an uncommon first or last name, it often leads to 
excessive numbers of "hits." Another approach is to sum- 
marize and categorize web documents and make the sum- 
maries searchable by category. 
In either case, a typical search engine searches for search 

40 terms in each candidate document and returns a list of the 
documents which meet the search criteria. Unfortunately, the 
information to be gained from the interrelationships of 
documents is lost. From the above it is seen that an improved 
search system which takes into account the interrelation- 

45 ships between documents is needed. 

SUMMARY OF THE INVENTION 

An improved search system which takes into account 
interrelationships among documents by searching across 

50 links is provided by virtue of the present invention. In one 
embodiment of the present invention, the documents are 
references in a hierarchical document repository used for 
keyword and topical searches. A search query is applied to 
the hierarchy, which returns documents which directly 

55 match a search query term or indirectly match the search 
query term by being a child document in the hierarchy from 
a parent document matching all or part of the query term. In 
a preferred embodiment, a returned document matches at 
least one subterm of the query term directly. 

60 One advantage of the present invention is that it provides 
for efficient storage of hierarchical data while allowing 
searches to be performed taking into account relationships 
among data elements in a hierarchy. 
A further understanding of the nature and advantages of 

65 the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached 
drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic diagram of a client-server system 
having a search engine according to one embodiment of the 
present invention. 

FIG. 2 is a tree graph of documents corresponding to parts 
of the document repository. 

FIG. 3 is a more detailed view of elements of the 
client-server system shown in FIG. 1, showing further 
details of a document repository, a word index and a search 
engine. 

FIGS. 4(a)-(c) are examples of match lists used by the 
search engine shown in FIG. 3. 

FIG. 5 is a screen shot of a browser display of search 
results according to one embodiment of the present inven- 
tion. 

FIG. 6 is a flow chart of an AND operation performed by 
a search engine. 

FIG. 7 is a flow chart of an OR operation performed by 
a search engine. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

The present invention relates to an apparatus for searching 
for selected documents in a document repository containing 
a large number of documents. A search engine according to 
one embodiment of the prese nt invention receives a sea rch 
PY prgftfiiiQn and, hflSf ** on t hat search expression, se arches for 
matc hingjdociiment<; in the document rep ository and returns 
either the matching documents or a list of reference s to each 
of the matchi ng docum ents. Where the search expression is 
a co mplex logic jl function of tw^ or jrinre siihterms^ the 
search engine will return documents which match some of 
the subterms only indirectly. For example, the search expres- 
sion may be an "AND" of three subterms. Instead of only 
returning documents containing all three subterms, the 
search engine will also return documents which only have 
one or two of the subterms, if the remaining subterms are 
found anywhere in documents along the hierarchical path 
from the document to a root node. In some variations, 
documents with only indirect matches for all of the subterms 
are returned, but in the preferred embodiment, a returned 
document must match directly at least one subterm. 

The present invention is described herein with reference 
to a particular type of document, however it should be 
understood that the present invention and the embodiments 
described herein are usable with many other types of docu- 
ments. 

The documents described in the main example herein are 
records in a search database. The search database is orga- 
nized as a hierarchical structure of categories and site 
references. The structure might be automatically generated, 
but in the embodiment known as the Yahoo! search database, 
the categories and site references are placed in appropriate 
locations in the hierarchy by an editorial staff using the 
experience and suggestions from site submitters. 

The categories and site references are collectively referred 
to as the nodes of the structure. Some category nodes are 
parent nodes, in that they point to other category nodes 
(child nodes) representing more specific subcategories of the 
category represented by the parent node. Site nodes are child 
nodes from a category node (although a particular site might 
be listed in multiple categories and be a child node in several 
subtrees). 

Herein, a node might be described as being a parent, child, 
ancestor or descendant node of another node. Relative to a 



node N, a parent node is the node one level above node N 
in the hierarchy, N's child nodes are nodes one level below 
node N in the hierarchy, N's ancestor nodes are nodes at any 
level above node N, and N's descendant nodes are nodes at 
5 any level below node N. Typically, the hierarchy has a root 
node which has no ancestor nodes and has all other nodes as 
descendant nodes. 

In the embodiment described here, a category node can 
have category nodes, site nodes or both as child nodes, but 
10 site nodes do not have child nodes. Some category nodes 
might have no child nodes, but such empty categories are 
preferably deleted or hidden. Also, not all category nodes are 
required to have child nodes, but preferably empty catego- 
ries are deleted or hidden to avoid unnecessary clutter. 

FIG. 1 shows an example of a client-server system 10 in 
which such a search database is queried. System 10 is shown 
comprising an HTTP client 12 connected to a search server 
14 via Internet 16. Search server 14 is coupled to a document 
repository 20 and a word index 22 and responds to a search 
request 30 with a search result 32. 

In this specific example, HTTP client 12 is a browser, but 
other HTTP clients, such as search back-end processors, 
could be used instead of a browser. Also, it should be 
understood that system 10 could be implemented with 
Internet 16 replaced with an alternate communications chan- 
nel between HTTP client 12 and search server 14. 
Furthermore, it should be understood that while search 
server 14 is an HTTP server, it could handle requests using 
an entirely different protocol, so long as the different pro- 
tocol is understood by HTTP client 12 or its substitute. For 
brevity, only one HTTP client, one request and one response 
is shown, but it should be understood that, in practice, many 
clients will be accessing search server 14 substantially 
simultaneously, each with one or more search requests. In 
fact, if warranted, the tasks of search server 14 might be 
spread over multiple machines. If the tasks are spread over 
multiple machines, the preferred arrangement is to have the 
multiple machines presented to the clients as a single logical 
machine, to simplify client access. 

In operation, a user at a browser, or other HTTP client, 
sends a request 30 containing a search expression to search 
server 14 where search server 14 parses the search expres- 
sion and, if the search expression is in a valid format, uses 
the search expression to And documents in document reposi- 
tory 20 which match the search expression. Search server 14 
responds with either a list of matching documents or the 
documents themselves. Word index 22 is used to speed up 
the search for documents in document repository 20. 
50^ FIG. 2 shows how the documents in document repository 
20 are In^icall y arr anged. In this example, documents are 
elements oTasearch database whiclfis used to locate WWW 
site s of int erest. Each document represents a topica l catego ry 
or a^sits^afld- gach document is shown as a record 38 in a 
55 hi erarchical structure being in pa^gnt-ofr^faHd-relation with 
other records. Each record 38 is shown with a document 
number 40 and content 46. In the case of a document which 
is a category, co ntent 46 is the title of the category apd oth er 
text (not shown), such as hidden keywords, synonyms, 
60 descriptions, etc., whil e the content of documents which 
ref er to sites includes a titl e, a URL, a descrip tion, hidden 
keywords, synonymsTetc. uFcourse, some of these elements 
can be blank, where appropriate or desired. As explained 
above, in the Yahoo! search database, the documents are 
positioned in the, hierarchical structure by an editorial staff. 
In a typical procedure, a site promoter will s ubmilsit e 
information to the editorial staff, such as a site title, site 
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URL, proposed location in the hierarchy, description, etc. Referring now to word index 22 shown in FIG. 3, a small 
The editorial staff then evaluates the submission, changing subset of the contents of word index 22 is there shown. Word 
the suggested location if a more appropriate location exists, index 22 is organized as a plurality of records, with one 
cross links as needed, and adds, in some cases, hidden record per word'cccurring in the documents of document 
keywords, synonyms and/or a document importance weight- 5 repository 20, sorted in alphabetical order by word. Each 
ing value. record 52 in word index 22 is shown with a tag identifying 
links between records are shown in FIG. 2, with each link the word, followed by a list of document numbers. These 
connecting a more general topic (parent node) with a more document numbers represent the list . of documents conta in- 
specific topic or a site reference (child node). For example, ing the word. Where a word is so common as to be a search 
document #5 is a site reference to a WWW site relating to 3Q term of limited usefulness, such as the word "the", its record 
"Go", which is a board game and therefore a subtopic of the does not list all of the documents containing the word, but 
"Board Games" topic, which is a subtopic of the "Games" just an indication that the word should be ignored, 
topic, which is a subtopic of the "Recreation" topic, and so i n the preferred embodiment, search engine 36 uses a 
on. doc ument profile array 49. to improve search speed. 

While some site references, such as documents #5 and #6, 35 Typically, arr ay 4!) is stored in memory for quick access , 

are nodes off of a leaf category (i.e., one with no child Array 49 has one record per document and each record 

category nodes), other documents, such as document #21, is includes fields for a document number, a docume jLreposi- 

a node off of a nonleaf category. Collectively, the links tory_poinle^ a jjme s t amp, a child record range and an' 

define trees and subtrees which, as explained below, are impo rtance weighting _yidue. The document numbers corre- 

numbered so that the documents in any subtree are consecu- 2Q spon3 to document numbers of document repository 20 and 

tively numbered following the document number of the the document repository pointers correspond to physical 

document at the top of the subtree. disk locations, of the doc uments in do cument repository 20, 

Referring now to FIG. 3, a different view of the informa- so that array 49 can be used to perform some operations on 

tion shown in FIG. 2 is presented. FIG. 3 presents the documents which don't require an access of document 

information as it is likely to be stored, with records 38 in a 2S repository 20 itself. The jjme-6tajHp4dentifie^ that 

data table corresponding to nodes of the tree structure in the docum ent was last mod ified. The range of children field 

FIG. 2. It should be understood that the data structures of indicates which records are below the instant document in 

FIG. 3 represent one of many possible data arrangements. the hierarchy, so that search engine 36 can quickly build a 

Only a few records 38 are shown, but in practice many match list without having to refer to document repository 20 

millions of records might be present. 30 too often. / ^> 

The fields shown for records 38 are a document number The importance weighting value is a value set 

40, a subtree pointer 42 to a last node in a subtree (which can automatically, or by an editorial staff, to indicate how 

either be stored, generated on the fly as needed or obtained valuable and/or relevant a partic ular category or, site is 

from a memory array), a parent pointer 44 to a parent node, reiaiiyp. ^ oth p r cate gories and sites, fhe importance 

the text of the document represented by the record (shown 35 weighting value of a record might be adjusted based on 

here as a title 46 and a description 47), an optional set 48 of external events or the significance of a site. For example, a 

one or more keywords associated with the document, and a site related_Jo ^a_ particular group which is currently in the 

boolean indication 50 of whether a record is for a category news might be given a higher weighting, or a site might be 

or a site. As with the view of FIG. 2, some nodes point to given a higher weighting if the editorial staff determines that 

WWW sites and other nodes represent categories in a 40 the site is a popular or we U -designed. Although the category 

hierarchical topical category structure in which site elements records in array 49 shown in FIG. 3 do not have weighted 

are associated with one or more category ele'ments. It should categories, weighted^ categories might be useful. For 

be apparent from this description, that while the example is example, during boating season, the weighting for document 

a tree structure of topics and site references, the system #9 (category "boating"), might be increased. A record's 

described herein can search more complex documents. 45 weighting comes into play when m ultiple^d Qgumepts arc 

For category nodes, record 38 inning a title. 4f j, a beingjlisplayed as a search^ result, as the display , documents 

de scription 47 of the category, and possibly a set of hidde n \ are dispfay^n, order bvjheif-Weighting values. Of course, 

keywords 4«. ror site nodes, the record includes a title, a ~ 4L r J ~ 



descnption of ttiecite (possibly blank), and a UKL pointing 



to toe site/page referenced. Together, document number 40, fo overall weighting. 



subtree pointer « and parent pointer 44 describe th e linkag e 
bet ween records . For example, document #2 has "8" as its 
subtree pointer, indicating that all the documents numbered 
from 3 (the document number plus one) to 8 (the subtree 
pointer value) are in the subtree below document 2, and "1" 
as its parent pointer, indicating that document 1 is the parent 



other weighting factors, as described below, might override 
the importance weighting or be combined with it to form an 



^ The use of document repository 20 and word index 22 will 
now be described with reference to an example. In this 
example, a "wjs sp^f^in g for documents and pre sents a 
searc h reoqiest wim .a.queryJiujj^L^The game of Gc^TSearch 
; engine 36 l ooks up each of_the terms i n word index 22. 
Because they are so common, "the" and "of are either 



document of document 2. The other fields of the record 38 I ignored by search engine 36 or word index 22 returns 
for document #2 indicate that its content is "Games", it has \ instructions to ignore those words, as described above, 
no keywords listed and it is a category (as opposed to a site Search engine 38 then reads the document lists for "game" 
reference). The specification of an entire subtree using just eo and "go", gen erates a match list for each term and appl ies an 
the last document number in the subtree is possible because "AND" operator to the match lists as described below in 
of the particular assignment order of document numbers. connection with FIG. 6. 

Document repository 20 includes the necessary process- A match list is a list of all the documents that contain the 
ing logic to return documents requested by document num- list's match tag either directly or indirectly. A match tag is 
ber and either document repository 20 or search engine 36 65 a word or other search term or search element, depending on 
contains processing logic to search a record for an instance what the query term is. FIG. 2 illustrates why indirect 
of a field value which matches a query term. matches are important The example used throughout this 
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description is a search for categories and syes related to the course, if the descendant document is at one border of the 

game of "Go", a welt-Known ooard game uslng"1>Iack and indirect range, only one new indirect match record will be 

white markers. Since t he name of the game happe ns to be the created. That one new indirect match record would simply 

same as a common word 5 the English language, searching be the indirect range reduced by one document number at the 

for "go" would result in too many unrelated matches. 5 border. 

However, as can be seen, searching for "go" and "game" in This is illustrated in FIGS. 4(6)-(c). A match record 62, 

the same document would result in no matches. Therefore, shown in FIG. 4(6), has a direct match record for document 
each document needs to be^s earched as if it contained all of 12 and an indirect match record for documents #13 to #17. 

the se archable_element s (the searchable elements are words If documents #15 and #16 were changed such that they 

in this case) of all of its ancestor documents. Of course, the 1Q contained the match tag directly, the subtree would be 

content of all ancestor documents can be inserted into each represented by two indirect matches, one on each side of the 

of the descendant documents in its subtree, but with large direct match. Of course, if there were no matches on one side 

trees, this approach is wasteful and impractical. of the direct match, only one indirect match record would be 

Referring again to FIGS. 2-3, each item on a match list necessary, 

refers to single document, in the case of a direct match, or 35 H should be noted that one of the document ranges, 

a range of documents, in the case of an indirect match. FIG. "1? — contains only one document. This is to distin- 

4 shows several examples of match lists. The first, FIG. 4(a) S^ish indirect match records from direct match records. Of 

is a match list 60 which corresponds to the particular course, alternatives arrangements can be used. For example, 

documents shown in FIGS. 2-3. Mat ch list 60 contains th ree in a simple case, each match record could comprise just a 

items, or^iatciuECjards. The first is a direct match record 20 A a g and a document number, where the flag indicates 

indicating that document #3 matches the match tag and the i whether the document number refers to a direct or indirect 

second is an indirect match record indicating that documents [ match. For direct matches, the single number would be the 

#4 through #8 indirectly match the match tag. ' number of the matching document and for indirect matches, 

^ a In this example, since the match tags are text, "matching" me Dumber k lhe number of the last document of the range, 

(fodcurs when the document contains the match tag as a string 25 ^ fe not amW g U0US where lhe first number of the r a °g e is 

or substring in the document's content. In some cases, the me number following the number of the immediately pre- 

only form of substrings which are recognized are "right hand ceding direct match record. This will be the case unless 

wildcard" substrings, which are of the form of "word*". As document numbers are missing, because the documents 

can be seen from FIGS. 2-3, document #3 does indeed J* ere o^ered so as to have this property, 

directly match the match tag, "board", of match list 60. Referring back to FIG. 3, in some systems, depending on 

Documents #4 through #8 do not contain the word "board" how_oft en documents are changed and how often sea rch 

directly, but they are c hilddocumenj s/n odes fro m a terms are_used. both direct matches and indirect matches 

document/ node which does contain the w ord. Because they might be p recalculated a nd stored in rec ords 52. Otherwise, 

are children from a p arent which contains the word and the they arc created onTfiely as needed. If that is done, search 

children do not contain the word, they arelReTeibre indirect 3S engine 36 need not access document repository 20 unless a 

matches. search query requires an examination of the position of 

Match list 60 has a third match record, "null", which words in tne documents or other field information which 

simply indicates the end of the match list. The use of a null cannot be obtained from word index 22. An advantage of 

item at the end of a list is a well-known computing technique word mdex 22 ^ tnat me matcD i& & m m order for 

and many other list handling techniques can be used in place 40 aching given a search request, 

of the particular one described here. 0nce a match list is obtained or generated by search 

The dugcLm^chjscords in a match list come from word-) en 8 me 36 > ix retur ns an output list 31. Output list 31 can be 
index 22TTn^ndjrectmatch records are obtained by exam- 1 lne listed docuir » e nts themselves, or just the document 

ining the documenfrecord in document repository '20 or a numbers. If output list 31 is the documents themselves, and 

document summary record in array 49 for each direft 'match. 45 il * appropriately formatted, output list 31 might be the 

If a direct match document record indicates that th e docu- resuil 32 which fe ^ to browser 12 (see FIG. 1). 

menfEas ^subtree, an JMirccLmatch-record-isj^re.ated for FIG. 5 shows an example of a display 53 of a search result 

the do cument range in the subtr ee. Where a document in the which might result from the query string: "The game of go", 

subtreels also a direct match, it is excluded from the indirect 0n display 53, matching category documents 54 are shown 

match range (which may result in a range being split over 50 above, and separated from, matching site documents 58, 

two indirect match records). As each direct match is added shown with their paths 56 through the c ategory n ee. FIG. 5 

to a match list, the male h_ list is checked tn ^ e| errnine if an represents an actual search through the* category structure 

indirect^maieh^a__range) already on the match list overlaps and s i te listings of Yahoo!, Inc., the assignee of the present 

the direct match. This occurs where the match term appears application. For clarity, not all the matches shown in FIG. 5 

in both an a nce^nr document and a descendant doc ument. 55 are represented in other figures and not all of the actual 177 

Because ancestor documents have lower document numbers site matches found are shown in FIG. 5. 

than their descendants, the ancestor document is processed Several ease-of-use features of display 53 should be 

first A direct match record is created for the ancestor noted. Each of the "hits" or matches (54, 58) are shown with 

document and then an indirect match record is created for a concatenation of titles of categories defining a path to the 

the group of descendant documents below the direct match 60 match. This provides the user with context. Examples of this 

document. When the descendant direct match document is are shown by matching category documents 54 shown in 

processed, it too will be listed in a direct match record, and FIG. 5. To further improve readability, matching documents 

therefore should not be included in the range of an indirect which are chil dren nodes from a matching, documept are not 

match. To keep each indirect march associated with only one shown. If they' were, all of the records under matching 

range of documents, the indirect match record is split into 65 category documents 54 would have been shown, 

two indirect match records, one on each side of, and Now that match lists, with direct and indirect match 

excluding, the descendant direct match document. Of records, and their generation have been described, the appli- 
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cation of operations, such as "AND", "OR", "ADD" and 
"SUBTRACT" on match lists to form other match lists will 
now be described. These operations are useful where a 
search engine needs to generate a match list for a complex 
search expression which contains a plurality of search 5 
subterms where a match list is available for each of the 
search subterms. Continuing the example described above, 
the search engine might combine the match lists for the 
search terms "go" and "game" using an AND operator to 
arrive at a match list (or document list) for the search JQ 
expression "go AND game". 

FIG. 6 is a flow chart of a process of "AND"ing two or 
more match lists to generate a new match list. As will be 
apparent, the resulting match list can then be used to 
generate search results or can be used as an input to ^ 
subsequent logical operations on match lists. If subsequent 
logical operations are not going to be done, the output could 
simply be a list of documents. In FIG. 6, the steps are labeled 
SI, S2, and so on, generally representing the order of 
execution of the steps. As will be apparent from reading this ^ 
description, other arrangements of the steps may perform 
substantially the same function to achieve substantially the 
same results. 

The need for "AND"ing two or more match lists might 
come about where a search string contains an expression of ^ 
the form "expression_A AND expression__B AND . . . ". 
One match list is obtained for "expression__A" indicating 
the documents that contain that subterm, another for 
"expression„B", and so on. The resulting match list is a list 
of all the documents which contain all of the "AND"ed 3Q 
expressions and directly contain at least one of the search 
subterms. It should be apparent that other variations of these 
requirements can be handled by modifications of this pro- 
cess which should be apparent after reading this description. 

In broad terms, the process described in FIG. 6 is an 35 
efficient process for scanning a plurality of match lists to find 
which documents are found in all of the match lists and 
found in at least one direct match record. To do this, the 
process involves first locating a direct match record in one 
match list and then checking all other match lists to deter- 40 
mine if the document is found on those lists. When one 
match list is found not to have the document on it, a 
document cursor is incremented to the next document in the 
match list. 

Referring again to FIG. 6, process variables are initialized 45 
at step SI. A loop counter, LOOP__CNT, is initialized to 
zero. The use of the loop counter is explained below. In 
addition, a document cursor (D_CUR) which points to 
documents in the match lists, is set equal to one, a collection 
counter (COLL_CNI) which counts the number of matches 50 
found, is set to zero, and a list pointer (L_PTR) which 
points to one of the match lists, is set to point to one of the 
match lists. L_PTR may, but need not, be pointed to the 
match list for the first listed subterm being "AND"ed. The 
match list pointed to by L_PTR is referred to herein as the 55 
"current match list" or the "current list". 

At step S2, the current match record is obtained from the 
current list This is referred to herein as the "current match 
record". The current match record is the match record in the 
current list which has the lowest document number greater 60 
than or equal to D_CUR. If the current match list is empty, 
the process simply ends, because no documents will be 
found. The first time through step S2, D_CUR will be 1, so 
the current match record will be the first record in the current 
match list. 65 

In step S3, the current match record is checked to deter- 
mine if it is a direct match or a group match. As should be 
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apparent from the description of FIG. 4, it is a simple matter 
to determine if a match record is a direct match or not, 
because indirect, or group, matches are expressed as ranges 
of one or more document numbers. If the current match 
record is a direct match, that satisfies the requirement that 
there be at least one direct match for the document repre- 
sented by the current match record. Prior to the first direct 
match being found, COLL_CNT is zero, indicating that the 
process has not yet begun "collecting" a document from the 
match lists. If COLL_CNT is zero at step S4 and a direct 
match record is found, COLL_CNT is set to 1 (S5) to 
indicate that a direct match has been found. 

Once the first direct match is found, LOOP_CNT is reset 
to zero at step S6 (LOOP_CNT is used to prevent infinite 
loops which might otherwise occur in some situations; its 
use is explained below), L_PTR is rotated to point to a next 
match list (S7). Following that, the process loops back to 
step S2 with the next match list being the current match list. 
At step S2, a current match record is found in the now 
current match list. As described above, the record found is 
the one with the lowest document number equal or greater 
than D_CUR and if none is found, the process terminates. 

If COLL_CNT is nonzero at S4, indicating a state of 
document collection, D__CUR is checked (S8) against the 
document number of the current match record. If they are the 
same, that indicates that the document being collected from 
the prior match list is the same as for the now current match 
list. If that is the case, COLL_CNT is incremented (S9) to 
indicate that another match has been found. If less than all 
of the N match lists have been processed, COLL_CNT will 
be less than N, so the process continues at steps S6/S7 where 
the next match list is made the current match list. This may 
continue until COLL_CNT is equal to N. 

When COLL__CNT reaches N, it means that the docu- 
ment number equal to D_CUR was found in aU N of the 
match lists and therefore is a document number which 
should be in the output match list. Consequently, the current 
match record is output (S10) and the process continues at 
steps S6/S7 (although the process could also continue by 
looping back to step S2 without changing the current match 
list). At step S10, COLL„CNT is reset to zero for the next 
cycle of document number searching. 

If, at step S8, the document number of the current match 
record is not equal to D_CUR, it is because the current 
match list did not have a match record with a document 
number equal to D_CUR and a greater document number 
was chosen. In that case, a current document is still being 
collected, but it is the new, greater document number. 
D_CUR is set to that new document number (Sll). To keep 
track of how many match lists have this new document 
number, COLL__CNT is reset to one (S5) and the process 
continues as described above. 

If, at step S3, the current match record is a group match 
instead of a direct match, the processing of the record 
depends on the state of the process, i.e., whether or not a 
document is being "collected". This is determined by check- 
ing COLL_CNT(S12). If COLL_CNT is nonzero, a docu- 
ment is being collected, in which case the current match 
record is compared to D_CUR (S13). If D_CUR is within 
the range of the current match record (which must be a group 
record to get to this step), then COLL_CNT is incremented 
(S9) and the next list is checked, as described above. 

If, at step S12, COLL_CNTis zero, the process continues 
at step S14. Also, if at step S13, D_CUR is not within the 
range of the match record, the process continues at step S14 
after setting COLL__CNT to zero, to indicate that no docu- 
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mcnt is being collected. At step S14, LOOP_CNT is incre- number, the direct matches are ordered before the indirect 

mented and compared to N (S15). If LOOP_CNT is not matches. If the heap has more than one indirect match with 

equal to N, the process continues with the next list at step S7. the same starting document number, they are sorted by their 

If LOOP_CNT is equal to N, it is an indication that all N ending document number. 

lists were examined and a match was found in each, but none 5 Once the heap is sorted, the top heap item is removed 

of the matches were direct matches; otherwise COLL_CNT from the heap (S35). If the top heap item is a direct match 

would be nonzero. * lem an( * tne document number of that direct match item is 

if inno r*\rr •„ - i »~ m «o *u«t « ™,.« not already in the output list, it is added to the output list 

If LOOP_CNT is equal to N, it means that a group j r 4 . , ... f 

/• «• v , . , t • n r j • (S36) and the process continues by adding another item to 

(indirect) match record enclosing D_CUR was found in ^ e ^ ^ matcn ^ of the ^ ren f oved item (s37 ). 

each of the .match lists and rtierefore no direct match is 10 Jf ^ ^ ^ ^ no ^ ^ ^ fa ^ £ 

present for D_CUR Each of the groups enclosing D_CUR ^ fa Eventllillly> ^ heap ^ emply out . If at step S38, 

is exarmned to find the group with the lowest ending ^ ^ fe ' ^ ^ ^ *^ and tenninates 

document number. Alternatively the search engine m.ghl 0therwise> Ac process loops back to step S34, where the 

just keep track of the lowest ending document number as Q ^ agam or d e red 

each matchlist is examined. D_CUR is set to one greater 15 . j * * j • •_ 

, j- j * u . If the removed item is a direct match record with a 

than the lowest ending document number (Slo) and the , . - , . , . .. 

. c * . * . . u mnn document number of a document already on the output list, 

search for documents continues at step So, where LOOP , r . , v * - J 

^ KI ™ . t t „, C£ . j . * t r\r\n a match count for that document number is incremented 

CNT is set to zero. Step S6 is positioned to reset LOOP_ . . j i_ j 

^xt-t u j* _ . i * jp j t , » • (S39) and the process contmues at step S34, as described 

CNT when a direct match is found, an output record is v . ' r , . . . ,. * ' ... 

* * t r\r\n rvrr u ki j • a •* i - in above. If the removed item is an indirect match record, it is 
output or LOOP_CNT reaches N and an infinite loop is 20 ... . , , . .* 

• • j ■ • ¥x i^i m * .l j r * not placed on the output list, but the match count is mcre- 

avoided by moving D_CUR past the end of a current group. * , c 5j * • . • • .... , 

w t J c j f t i . 1 • t , L r mented (S39) for each document which is within the docu- 

In the preferred embodiment, at least one direct match is * . ■ j. L JJL 

• j t f 1 j * . . j * • _ * u ment range of the indirect match record and the process 

required. However, in an embodiment where a direct match & ^ . ,. *, , 

. . . j .« - # . # , , continues at step S34. The indirect match is not added to the 

is not required, the process might output a match record ... , . . . . 

when LOOP CNT reaches N 25 p ' ^czuse anv documents 1Q *c document range 

„„ . ~ * „ . . , for that indirect match which meet the requirement of having 

FoUowing to process to its conclusion, when the end of „ , eas , ooe direc , match ^ be 0Q , he out , ^ 

a current match hst is reached when passing through step S2, ^ ^ ffla , ch records m takeQ from ^ match Us , 

u" 1 ^ 600 " 18 f ° r doc " m f nls mee,m S «be requirements fa 0(der when th m laced 0Q ^ h , he items 0Q 

of the AND operation would have been output in the passes me h are ^ ^ off ^ or(fc an(J direc , ma , ches afe 

through step S10. Alternatively, if no further logical opera- takeQ MoK indirect matches which start at the same 

tions are to be done, the output could just be a listing of the number 

document numbers of matching documents. „„ * . , . 4 , 4 4 t . M 4 . „ - 

. & . When the heap is empty, the output hst will contain all of 

VP Referring ^now to FIG. 7, a process for generating an the documents which match the OR criteria. All of the output 

output list of documents whii^atcka^aichexpression of 35 Ust entries ^ ^ direct matches and ^ have an 

formJ^QB, B OR . . » fp nuhg_saichjisjs for the match coum , f the reqilirement that each match mmia at 

subterms^B^c^Ji there shown. In broad terms, this least one directly fc not imposed, the output list 

process involves parsing the search expression into its ^ ^ ^ the form of a match list suitable for 

subterms and identifying a match hst for e ach subterm, then processing. The match count can be used, alone or in 

combiniggjhg. m a trti lis ts i nto nn,?amU8L2^ ^ combination with importance weighting, to order documents 

document^oj LJhe output list contains at leasL one of the according to relevance 

subterjns. Turning now to the "ADD" and "SUBTRACT" 

In the preferred embodiment, an additional requirement is operations, these are much simpler. For "ADD" operations, 

imposed that each document on the output list have at least me document numbers to be added to a list are simply 

one directmatch, so there will be no indirect matches, as a 45 inserted. 0 f course, if a direct match is to be added to a list 

docum^nTmeeting the additional requirement will necessar- containing an indirect, group match enclosing the document 

ily directly match the OR expression. In the preferred number of he direct match, the group match record is split 

embodiment, the output list is a list of direct matches each ^ described above. For "SUBTRACT" operations, match 

having an associated match count. A match count indicates tccOT ds are simply deleted from the match list. If a document 

how many of the OR subterms are matched, directly or 5Q number is to be subtracted where the document is within a 

indirectly, and therefore is an indication of relative relevance range of a match record> me g^p mat ch record is 

of a particular document. splil ^ described above. 

In the flow chart of FIG. 7, the steps of the process are above description is illustrative and not restrictive, 

labelled S30, S31, etc., and are executed in numerical order Many variations of the invention will become apparent to 

except where indicated. The process begins at step S30, 55 m0 se of skill in the art upon review of this disclosure. For 

where the subterms are extracted from the search expression example, the hierarchical structure of documents might be a 

and the match list counter, N, is set equal to the number of weD of documents on the Internet instead of the hierarchical 

subterms. At step S31, one match list is generated for each search structure described above. The scope of the invention 

subterm, or the lists arc retrieved if they are preexisting lists. shou i df therefore, be determined not with reference to the 

At step S32, one cursor is initialized for each match list 60 above description, but instead should be determined with 

with the cursor pointing to the first document in its associ- reference to the appended claims along with their full scope 

ated list. At step S33, the first document from each list is of equivalents, 
added to an N-member heap. What is claimed is: 

Next, the heap contents are ordered by document number 1. A method of searching for documents stored in a 

(S34). In the preferred embodiment, where the heap contains 65 document repository, wherein documents contain searchable 

a direct match for a particular document number and an elements and are organized into a document hierarchy, the 

indirect match with a range beginning at that same document method comprising the steps of: 
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providing a search expression to a search engine, wherein 
the search expression is a logical function describing a 
set of searchable elements; 

searching for direct matches or indirect matches, wherein 
a direct match is a document which matches the search 5 
expression and an indirect match is a document which 
only matches the search expression when contents of 
the indirectly matching document are combined with 
contents of the indirectly matching document's ances- 
tor documents in the hierarchy; 

generating a list of at least one match from the results of 
the step of searching, where a match over multiple 
documents is expressed as a path in the hierarchy which 
links the multiple documents; and outputting the list as 
a search result. 

2. The method of claim 1, wherein the searchable ele- 
ments are words and documents and comprise at least some 
text. 

3. The method of claim 1, wherein the step of searching 
comprises a step of searching for components of the search 
expression in an element index. 

4. The method of claim 1, further comprising a step of 
assigning a document number to each document in a hier- 
archical tree such that the document numbers within any 
branch of the hierarchical tree are consecutive. 

5. The method of claim 1, wherein the search expression 
is a formula comprising operands and operators, wherein the 
operands comprise specified searchable elements or wild 
cards and wherein the operators comprise AND, OR, ADD 
or MINUS. 
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6. The method of claim 1, wherein a document is a string 
representing a specific topic and the hierarchy is a hierarchy 
of topics. 

7. A method of efficiently storing and searching hierar- 
chical data, comprising the steps of: 

organizing data elements into a hierarchy, wherein each 
data element has a position in the hierarchy and has 
ancestor data elements above the position or descen- 
dant data elements below the position or both; 

assigning a data element number to each data element 
such that the data element number of a data element is 
greater than a data element number of any ancestor data 
element and is less than a data element number of any 
other data element which is not a descendant of the 
ancestor data element and has a data element number 
greater than the ancestor data element number, and 

applying a search expression to the hierarchy to identify 
data elements which match the search expression either 
directly or indirectly, wherein the search expression is 
matched directly when content of the data element 
alone matches the search expression and is matched 
indirectly when the data element does not match 
directly, but the content of the data element and at least 
one ancestor data element together match the search 
expression. 

***** 
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